Project - California Housing Price Prediction

DESCRIPTION

Problem Statement

  • The US Census Bureau has published California Census Data which has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California. The dataset also serves as an input for project scoping and tries to specify the functional and nonfunctional requirements for it.

Goal

  • The project aims at building a model of housing prices to predict median house values in California using the provided dataset. This model should learn from the data and be able to predict the median housing price in any district, given all the other metrics.

Data Snapshot

In [4]:
Image('californiascreenshot.png')
Out[4]:
In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from statsmodels.formula.api import ols
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.decomposition import PCA
import xgboost as xgb
from IPython.display import Image
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import reverse_geocoder as rg
import folium
In [6]:
dt=pd.read_csv('C:\\Users\\Subhasish Das\\Desktop\\SimplyLearn\\Project\\Kaggle\\California_Housing_Price_Prediction\\housing.csv')
In [7]:
dt.head()
Out[7]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity median_house_value
0 -122.23 37.88 41 880 129.0 322 126 8.3252 NEAR BAY 452600
1 -122.22 37.86 21 7099 1106.0 2401 1138 8.3014 NEAR BAY 358500
2 -122.24 37.85 52 1467 190.0 496 177 7.2574 NEAR BAY 352100
3 -122.25 37.85 52 1274 235.0 558 219 5.6431 NEAR BAY 341300
4 -122.25 37.85 52 1627 280.0 565 259 3.8462 NEAR BAY 342200
In [8]:
dt.shape
Out[8]:
(20640, 10)
In [9]:
dt.dtypes
Out[9]:
longitude             float64
latitude              float64
housing_median_age      int64
total_rooms             int64
total_bedrooms        float64
population              int64
households              int64
median_income         float64
ocean_proximity        object
median_house_value      int64
dtype: object
In [10]:
dt.describe()
Out[10]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

Feature-Engineering

In [11]:
dt.hist(bins=50,figsize=(20,15))
plt.show()

Checking Null Values

In [12]:
dt.isnull().sum()
Out[12]:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64
In [13]:
dt.total_bedrooms.unique()
Out[13]:
array([ 129., 1106.,  190., ..., 3008., 1857., 1052.])
In [14]:
ax=dt.total_bedrooms.hist(bins=40)
In [15]:
mean_bedrooms=dt.total_bedrooms.mean()
In [16]:
mean_bedrooms
Out[16]:
537.8705525375618
In [17]:
median_bedrooms=dt.total_bedrooms.median()
In [18]:
median_bedrooms
Out[18]:
435.0
In [19]:
dt.fillna(median_bedrooms,axis=1,inplace=True)
In [20]:
dt.isnull().sum()
Out[20]:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
median_house_value    0
dtype: int64
In [21]:
sns.pairplot(dt)
Out[21]:
<seaborn.axisgrid.PairGrid at 0x25155788b38>
In [22]:
plt.figure(figsize=(10,10))
sns.heatmap(dt.corr(),annot=True,cmap='coolwarm')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x25155f1a390>
In [23]:
def reverseGeocode(coordinates): 
    result = rg.search(coordinates)
    return (result)
In [24]:
if __name__=="__main__": 

    # Coordinates tuple.Can contain more than one pair.
    coordinates =list(zip(dt['latitude'],dt['longitude'])) # generates pair of (lat,long)
    data = reverseGeocode(coordinates)

    dt['name'] = [i['name'] for i in data]
    dt['admin1'] = [i['admin1'] for i in data]
    dt['admin2'] = [i['admin2'] for i in data]
Loading formatted geocoded file...
In [25]:
dt.head()
Out[25]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity median_house_value name admin1 admin2
0 -122.23 37.88 41 880 129.0 322 126 8.3252 NEAR BAY 452600 Berkeley California Alameda County
1 -122.22 37.86 21 7099 1106.0 2401 1138 8.3014 NEAR BAY 358500 Piedmont California Alameda County
2 -122.24 37.85 52 1467 190.0 496 177 7.2574 NEAR BAY 352100 Piedmont California Alameda County
3 -122.25 37.85 52 1274 235.0 558 219 5.6431 NEAR BAY 341300 Berkeley California Alameda County
4 -122.25 37.85 52 1627 280.0 565 259 3.8462 NEAR BAY 342200 Berkeley California Alameda County
In [26]:
dt.rename(columns={'name':'City','admin1':'State','admin':'County'},inplace=True)
In [27]:
latitude = 37.88
longitude = -122.23
traffic_map = folium.Map(location=[latitude, longitude], zoom_start=5)
In [28]:
colordict = {0: 'lightblue', 1: 'lightgreen', 2: 'orange', 3: 'red'}
In [29]:
for latitude, longitude, City,median_house_value in zip(dt['latitude'], dt['longitude'], dt['City'], dt['median_house_value']):
    
    folium.CircleMarker(
        [latitude, longitude],
        
        popup = ('City: ' + str(City).capitalize() + '<br>'
                 'median_house_value: ' + str(median_house_value)
                 
                ),
        color='b',
        
        
        fill=True,
        fill_opacity=0.7
        ).add_to(traffic_map)
    
display(traffic_map)
Make this Notebook Trusted to load map: File -> Trust Notebook
In [30]:
plt.figure(figsize=(20,25))
dt.plot(kind='scatter',x="longitude",y="latitude",alpha=0.4,
       s=dt['population']/100,label="Population",
       c='median_house_value',cmap=plt.get_cmap('jet'),colorbar=True)
plt.legend()
Out[30]:
<matplotlib.legend.Legend at 0x25178a4b588>
<Figure size 1440x1800 with 0 Axes>

Creating New Attribute

In [31]:
dt['Rooms_perHousehold']=dt['total_rooms']/dt['households']
In [32]:
dt['bedrooms_per_room']=dt['total_bedrooms']/dt['total_rooms']
In [33]:
dt['Population_per_household']=dt['population']/dt['households']
In [34]:
dt.head()
Out[34]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity median_house_value City State admin2 Rooms_perHousehold bedrooms_per_room Population_per_household
0 -122.23 37.88 41 880 129.0 322 126 8.3252 NEAR BAY 452600 Berkeley California Alameda County 6.984127 0.146591 2.555556
1 -122.22 37.86 21 7099 1106.0 2401 1138 8.3014 NEAR BAY 358500 Piedmont California Alameda County 6.238137 0.155797 2.109842
2 -122.24 37.85 52 1467 190.0 496 177 7.2574 NEAR BAY 352100 Piedmont California Alameda County 8.288136 0.129516 2.802260
3 -122.25 37.85 52 1274 235.0 558 219 5.6431 NEAR BAY 341300 Berkeley California Alameda County 5.817352 0.184458 2.547945
4 -122.25 37.85 52 1627 280.0 565 259 3.8462 NEAR BAY 342200 Berkeley California Alameda County 6.281853 0.172096 2.181467

Working with ocean_proximity

In [35]:
dt.ocean_proximity.unique()
Out[35]:
array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)
In [36]:
dt.ocean_proximity.value_counts()
Out[36]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

Now I will provide rank to each value of ocean_proximity column with respect to median_house_value

In [37]:
ordinal_labels=dt.groupby(['ocean_proximity'])['median_house_value'].mean().sort_values().index
In [38]:
ordinal_labels
Out[38]:
Index(['INLAND', '<1H OCEAN', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'], dtype='object', name='ocean_proximity')
In [39]:
list(enumerate(ordinal_labels,0))
Out[39]:
[(0, 'INLAND'),
 (1, '<1H OCEAN'),
 (2, 'NEAR OCEAN'),
 (3, 'NEAR BAY'),
 (4, 'ISLAND')]
In [40]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels2
Out[40]:
{'INLAND': 0, '<1H OCEAN': 1, 'NEAR OCEAN': 2, 'NEAR BAY': 3, 'ISLAND': 4}
In [41]:
dt['ocean_proximity_labels']=dt['ocean_proximity'].map(ordinal_labels2)
dt.head()
Out[41]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity median_house_value City State admin2 Rooms_perHousehold bedrooms_per_room Population_per_household ocean_proximity_labels
0 -122.23 37.88 41 880 129.0 322 126 8.3252 NEAR BAY 452600 Berkeley California Alameda County 6.984127 0.146591 2.555556 3
1 -122.22 37.86 21 7099 1106.0 2401 1138 8.3014 NEAR BAY 358500 Piedmont California Alameda County 6.238137 0.155797 2.109842 3
2 -122.24 37.85 52 1467 190.0 496 177 7.2574 NEAR BAY 352100 Piedmont California Alameda County 8.288136 0.129516 2.802260 3
3 -122.25 37.85 52 1274 235.0 558 219 5.6431 NEAR BAY 341300 Berkeley California Alameda County 5.817352 0.184458 2.547945 3
4 -122.25 37.85 52 1627 280.0 565 259 3.8462 NEAR BAY 342200 Berkeley California Alameda County 6.281853 0.172096 2.181467 3
In [42]:
dt.drop('ocean_proximity',axis=1,inplace=True)

Feature Selection

I will apply different techniques

1. Corelation

In [43]:
dt.corr()
Out[43]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value Rooms_perHousehold bedrooms_per_room Population_per_household ocean_proximity_labels
longitude 1.000000 -0.924664 -0.108197 0.044568 0.069120 0.099773 0.055310 -0.015176 -0.045967 -0.027540 0.081205 0.002476 -0.271730
latitude -0.924664 1.000000 0.011173 -0.036100 -0.066484 -0.108785 -0.071035 -0.079809 -0.144160 0.106389 -0.098619 0.002366 0.007695
housing_median_age -0.108197 0.011173 1.000000 -0.361262 -0.319026 -0.296244 -0.302916 -0.119034 0.105623 -0.153277 0.135622 0.013191 0.295012
total_rooms 0.044568 -0.036100 -0.361262 1.000000 0.927058 0.857126 0.918484 0.198050 0.134153 0.133798 -0.187381 -0.024581 -0.031586
total_bedrooms 0.069120 -0.066484 -0.319026 0.927058 1.000000 0.873535 0.974366 -0.007617 0.049457 0.001765 0.071649 -0.028325 -0.010067
population 0.099773 -0.108785 -0.296244 0.857126 0.873535 1.000000 0.907222 0.004834 -0.024650 -0.072213 0.010035 0.069863 -0.039415
households 0.055310 -0.071035 -0.302916 0.918484 0.974366 0.907222 1.000000 0.013033 0.065843 -0.080598 0.034498 -0.027309 0.012873
median_income -0.015176 -0.079809 -0.119034 0.198050 -0.007617 0.004834 0.013033 1.000000 0.688075 0.326895 -0.545298 0.018766 0.163755
median_house_value -0.045967 -0.144160 0.105623 0.134153 0.049457 -0.024650 0.065843 0.688075 1.000000 0.151948 -0.233303 -0.023737 0.397251
Rooms_perHousehold -0.027540 0.106389 -0.153277 0.133798 0.001765 -0.072213 -0.080598 0.326895 0.151948 1.000000 -0.370308 -0.004852 -0.106435
bedrooms_per_room 0.081205 -0.098619 0.135622 -0.187381 0.071649 0.010035 0.034498 -0.545298 -0.233303 -0.370308 1.000000 0.002601 0.061114
Population_per_household 0.002476 0.002366 0.013191 -0.024581 -0.028325 0.069863 -0.027309 0.018766 -0.023737 -0.004852 0.002601 1.000000 -0.019326
ocean_proximity_labels -0.271730 0.007695 0.295012 -0.031586 -0.010067 -0.039415 0.012873 0.163755 0.397251 -0.106435 0.061114 -0.019326 1.000000
In [44]:
plt.figure(figsize=(12,10))
cor = dt.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()
In [45]:
def correlation_feature(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

Settting the threshold value to 0.9

In [46]:
corr_features = correlation_feature(dt, 0.9)
len(set(corr_features))
Out[46]:
3
In [47]:
corr_features
Out[47]:
{'households', 'latitude', 'total_bedrooms'}

2. Finding zero variance column

In [48]:
zero_var=dt.var()[dt.var()==0].index.values
In [49]:
zero_var
Out[49]:
array([], dtype=object)

From the above corr_features I can drop {'bedrooms_per_room', 'households', 'latitude', 'total_bedrooms'} this columns
I also need to drop other object type column which has no co-relation withthe target variable

In [50]:
dt.dtypes
Out[50]:
longitude                   float64
latitude                    float64
housing_median_age            int64
total_rooms                   int64
total_bedrooms              float64
population                    int64
households                    int64
median_income               float64
median_house_value            int64
City                         object
State                        object
admin2                       object
Rooms_perHousehold          float64
bedrooms_per_room           float64
Population_per_household    float64
ocean_proximity_labels        int64
dtype: object
In [51]:
dt_housing=dt.copy()
In [52]:
dt.drop(['City','State','admin2','bedrooms_per_room', 'households', 'latitude', 'total_bedrooms'],axis=1,inplace=True)

Scalling of Data

In [53]:
min_dt=dt.min()
range_dt=(dt-min_dt).max()
dt_scaled = (dt-min_dt)/range_dt
In [54]:
dt_scaled.head()
Out[54]:
longitude housing_median_age total_rooms population median_income median_house_value Rooms_perHousehold Population_per_household ocean_proximity_labels
0 0.211155 0.784314 0.022331 0.008941 0.539668 0.902266 0.043512 0.001499 0.75
1 0.212151 0.392157 0.180503 0.067210 0.538027 0.708247 0.038224 0.001141 0.75
2 0.210159 1.000000 0.037260 0.013818 0.466028 0.695051 0.052756 0.001698 0.75
3 0.209163 1.000000 0.032352 0.015555 0.354699 0.672783 0.035241 0.001493 0.75
4 0.209163 1.000000 0.041330 0.015752 0.230776 0.674638 0.038534 0.001198 0.75

Model Development

In this section I will use differnt machine learning algorithm to predict the sales for these stores

Performance Metric

I am going to use R square value as performance metric. The intution behind R square value is when R square is closer to 1 indicates that it's a better model

In [55]:
x=dt.drop('median_house_value',axis=1)
y=dt['median_house_value']

Splitting the dataset into train and test dataset

In [56]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
In [57]:
x_train.shape,y_train.shape
Out[57]:
((14448, 8), (14448,))

Linear Regression

In [58]:
model_lr=LinearRegression()
model_lr.fit(x_train,y_train)
model_pred_tr=model_lr.predict(x_train)
model_pred_test=model_lr.predict(x_test)
In [59]:
print('R2 square for train',r2_score(y_train,model_pred_tr))
print('R2 square for test',r2_score(y_test,model_pred_test))
R2 square for train 0.5784798525280728
R2 square for test 0.5805454523877667

Observation

As I know if R square vale is close to 1 it indicate that it's a good model
From the above R square value for train and test I can see that both are not close to 1
For both train and test set my model is giving same value for R Square which indicate that model is not over fit

In [60]:
plt.scatter(y_test,model_pred_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[60]:
Text(0, 0.5, 'Predicted sales ')

Statsmodels

In [61]:
from statsmodels.formula.api import ols
In [62]:
feature = ' + '.join(dt.drop('median_house_value', axis = 1).columns)
'median_house_value ~ ' + feature
Out[62]:
'median_house_value ~ longitude + housing_median_age + total_rooms + population + median_income + Rooms_perHousehold + Population_per_household + ocean_proximity_labels'
In [63]:
mod = ols('median_house_value ~ ' + feature , data = dt)
# fit the model
lm = mod.fit()
lm.summary()
Out[63]:
OLS Regression Results
Dep. Variable: median_house_value R-squared: 0.579
Model: OLS Adj. R-squared: 0.579
Method: Least Squares F-statistic: 3549.
Date: Fri, 19 Feb 2021 Prob (F-statistic): 0.00
Time: 21:39:56 Log-Likelihood: -2.6094e+05
No. Observations: 20640 AIC: 5.219e+05
Df Residuals: 20631 BIC: 5.220e+05
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.805e+05 3.27e+04 11.650 0.000 3.16e+05 4.44e+05
longitude 3203.4072 273.346 11.719 0.000 2667.627 3739.187
housing_median_age 1177.0352 47.000 25.043 0.000 1084.911 1269.159
total_rooms 11.0004 0.542 20.299 0.000 9.938 12.063
population -16.8413 1.017 -16.553 0.000 -18.835 -14.847
median_income 3.862e+04 313.107 123.344 0.000 3.8e+04 3.92e+04
Rooms_perHousehold -2159.5436 237.971 -9.075 0.000 -2625.986 -1693.102
Population_per_household -177.1191 51.305 -3.452 0.001 -277.681 -76.557
ocean_proximity_labels 3.242e+04 619.935 52.295 0.000 3.12e+04 3.36e+04
Omnibus: 4529.557 Durbin-Watson: 0.808
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12469.088
Skew: 1.169 Prob(JB): 0.00
Kurtosis: 6.005 Cond. No. 2.41e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.41e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Observation

As I know if R square and adjusted R square values are close to each other means that input paramter's I am using are corelated to the target variable

Random Forest

In [64]:
model_rand=RandomForestRegressor()
model_rand.fit(x_train,y_train)
model_rand_train=model_rand.predict(x_train)
model_rand_test=model_rand.predict(x_test)
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [65]:
print('R2 square for train',r2_score(y_train,model_rand_train))
print('R2 square for test',r2_score(y_test,model_rand_test))
R2 square for train 0.9591290587410307
R2 square for test 0.7662555342011005

Observation

From Random forest model I am getting R2 square value more closer to 1 which indicate that this model is better than Linear Regression model

In [66]:
plt.scatter(y_test,model_rand_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[66]:
Text(0, 0.5, 'Predicted sales ')

Decision Tree

In [67]:
dct_mod=DecisionTreeRegressor()
dct_mod.fit(x_train,y_train)
dct_mod_train=dct_mod.predict(x_train)
dct_mod_test=dct_mod.predict(x_test)
In [68]:
print('R2 square for train',r2_score(y_train,dct_mod_train))
print('R2 square for test',r2_score(y_test,dct_mod_test))
R2 square for train 1.0
R2 square for test 0.5808326497982872

Observation

From Decision Tree Regressor model I am getting R2 square value .58 for test dataset which is less than the value I got from Random Forest

In [69]:
plt.scatter(y_test,dct_mod_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[69]:
Text(0, 0.5, 'Predicted sales ')

Support vector Regression

In [70]:
svr_mod=SVR()
svr_mod.fit(x_train,y_train)
svr_mod_train=dct_mod.predict(x_train)
svr_mod_test=dct_mod.predict(x_test)
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [71]:
print('R2 square for train',r2_score(y_train,svr_mod_train))
print('R2 square for test',r2_score(y_test,svr_mod_test))
R2 square for train 1.0
R2 square for test 0.5808326497982872
In [72]:
plt.scatter(y_test,svr_mod_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[72]:
Text(0, 0.5, 'Predicted sales ')

Now I will do the prediction using scalled data

In [73]:
x=dt_scaled.drop('median_house_value',axis=1)
y=dt_scaled['median_house_value']
In [74]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
In [75]:
model_lr=LinearRegression()
model_lr.fit(x_train,y_train)
model_pred_tr=model_lr.predict(x_train)
model_pred_test=model_lr.predict(x_test)
In [76]:
print('R2 square for train',r2_score(y_train,model_pred_tr))
print('R2 square for test',r2_score(y_test,model_pred_test))
R2 square for train 0.5784798525280728
R2 square for test 0.5805454523877666
In [77]:
plt.scatter(y_test,model_pred_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[77]:
Text(0, 0.5, 'Predicted sales ')

Statsmodels

In [78]:
feature = ' + '.join(dt_scaled.drop('median_house_value', axis = 1).columns)
'median_house_value ~ ' + feature
Out[78]:
'median_house_value ~ longitude + housing_median_age + total_rooms + population + median_income + Rooms_perHousehold + Population_per_household + ocean_proximity_labels'
In [79]:
mod = ols('median_house_value ~ ' + feature , data = dt_scaled)
# fit the model
lm = mod.fit()
lm.summary()
Out[79]:
OLS Regression Results
Dep. Variable: median_house_value R-squared: 0.579
Model: OLS Adj. R-squared: 0.579
Method: Least Squares F-statistic: 3549.
Date: Fri, 19 Feb 2021 Prob (F-statistic): 0.00
Time: 21:41:02 Log-Likelihood: 9280.2
No. Observations: 20640 AIC: -1.854e+04
Df Residuals: 20631 BIC: -1.847e+04
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -0.0297 0.005 -5.450 0.000 -0.040 -0.019
longitude 0.0663 0.006 11.719 0.000 0.055 0.077
housing_median_age 0.1238 0.005 25.043 0.000 0.114 0.133
total_rooms 0.8918 0.044 20.299 0.000 0.806 0.978
population -1.2389 0.075 -16.553 0.000 -1.386 -1.092
median_income 1.1546 0.009 123.344 0.000 1.136 1.173
Rooms_perHousehold -0.6281 0.069 -9.075 0.000 -0.764 -0.492
Population_per_household -0.4538 0.131 -3.452 0.001 -0.711 -0.196
ocean_proximity_labels 0.2674 0.005 52.295 0.000 0.257 0.277
Omnibus: 4529.557 Durbin-Watson: 0.808
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12469.088
Skew: 1.169 Prob(JB): 0.00
Kurtosis: 6.005 Cond. No. 160.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Random Forest

In [80]:
model_rand=RandomForestRegressor()
model_rand.fit(x_train,y_train)
model_rand_train=model_rand.predict(x_train)
model_rand_test=model_rand.predict(x_test)
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [81]:
print('R2 square for train',r2_score(y_train,model_rand_train))
print('R2 square for test',r2_score(y_test,model_rand_test))
R2 square for train 0.9567890534330203
R2 square for test 0.7644385471998768
In [82]:
plt.scatter(y_test,model_rand_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[82]:
Text(0, 0.5, 'Predicted sales ')

Decision Tree

In [83]:
dct_mod=DecisionTreeRegressor()
dct_mod.fit(x_train,y_train)
dct_mod_train=dct_mod.predict(x_train)
dct_mod_test=dct_mod.predict(x_test)
In [84]:
print('R2 square for train',r2_score(y_train,dct_mod_train))
print('R2 square for test',r2_score(y_test,dct_mod_test))
R2 square for train 0.999999961183257
R2 square for test 0.5830630013520243
In [85]:
plt.scatter(y_test,dct_mod_test)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[85]:
Text(0, 0.5, 'Predicted sales ')

Fine Tune my model

In [86]:
dt_housing.columns
Out[86]:
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'City', 'State', 'admin2', 'Rooms_perHousehold',
       'bedrooms_per_room', 'Population_per_household',
       'ocean_proximity_labels'],
      dtype='object')
In [87]:
dt_housing.drop(['City', 'State', 'admin2'],axis=1,inplace=True)
In [88]:
param_grid=[{'n_estimators':[3,10,30],'max_features':[2,3,4]},
            {'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]}
           ]
In [89]:
x_housing=dt_housing.drop(['median_house_value'],axis=1)
y_housing=dt_housing['median_house_value']
In [90]:
min_train = x_housing.min()
min_train
Out[90]:
longitude                  -124.350000
latitude                     32.540000
housing_median_age            1.000000
total_rooms                   2.000000
total_bedrooms                1.000000
population                    3.000000
households                    1.000000
median_income                 0.499900
Rooms_perHousehold            0.846154
bedrooms_per_room             0.037151
Population_per_household      0.692308
ocean_proximity_labels        0.000000
dtype: float64
In [91]:
range_train = (x_housing-min_train).max()
range_train
Out[91]:
longitude                      10.040000
latitude                        9.410000
housing_median_age             51.000000
total_rooms                 39318.000000
total_bedrooms               6444.000000
population                  35679.000000
households                   6081.000000
median_income                  14.500200
Rooms_perHousehold            141.062937
bedrooms_per_room               2.787524
Population_per_household     1242.641026
ocean_proximity_labels          4.000000
dtype: float64
In [92]:
x_housing_scaled = (x_housing-min_train)/range_train
x_housing_scaled
Out[92]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income Rooms_perHousehold bedrooms_per_room Population_per_household ocean_proximity_labels
0 0.211155 0.567481 0.784314 0.022331 0.019863 0.008941 0.020556 0.539668 0.043512 0.039261 0.001499 0.75
1 0.212151 0.565356 0.392157 0.180503 0.171477 0.067210 0.186976 0.538027 0.038224 0.042563 0.001141 0.75
2 0.210159 0.564293 1.000000 0.037260 0.029330 0.013818 0.028943 0.466028 0.052756 0.033135 0.001698 0.75
3 0.209163 0.564293 1.000000 0.032352 0.036313 0.015555 0.035849 0.354699 0.035241 0.052845 0.001493 0.75
4 0.209163 0.564293 1.000000 0.041330 0.043296 0.015752 0.042427 0.230776 0.038534 0.048410 0.001198 0.75
5 0.209163 0.564293 1.000000 0.023323 0.032899 0.011491 0.031574 0.243921 0.027757 0.069819 0.001165 0.75
6 0.209163 0.563231 1.000000 0.064423 0.075729 0.030578 0.084361 0.217873 0.028964 0.055873 0.001156 0.75
7 0.209163 0.563231 1.000000 0.078895 0.106456 0.032344 0.106233 0.180694 0.028011 0.066072 0.000882 0.75
8 0.208167 0.563231 0.803922 0.064932 0.103042 0.033717 0.097681 0.108998 0.024443 0.080043 0.001074 0.75
9 0.209163 0.563231 1.000000 0.090213 0.109559 0.043387 0.117250 0.220087 0.029238 0.058138 0.001191 0.75
10 0.208167 0.564293 1.000000 0.055954 0.067194 0.025421 0.065943 0.186425 0.032833 0.057378 0.001265 0.75
11 0.208167 0.564293 1.000000 0.089043 0.116543 0.042070 0.120539 0.191073 0.027834 0.063685 0.001092 0.75
12 0.208167 0.564293 1.000000 0.063304 0.073402 0.030690 0.076797 0.177591 0.031734 0.054936 0.001331 0.75
13 0.208167 0.563231 1.000000 0.017651 0.029485 0.009585 0.028449 0.149908 0.022358 0.085120 0.001038 0.75
14 0.208167 0.564293 1.000000 0.067170 0.096989 0.033885 0.101792 0.097709 0.024221 0.071641 0.001016 0.75
15 0.208167 0.564293 0.960784 0.028435 0.043762 0.019451 0.043249 0.112074 0.024076 0.077319 0.001568 0.75
16 0.207171 0.564293 1.000000 0.049952 0.053693 0.022142 0.054267 0.156901 0.036107 0.049990 0.001371 0.75
17 0.207171 0.564293 1.000000 0.031182 0.045313 0.018078 0.049663 0.111743 0.022732 0.072268 0.001164 0.75
18 0.208167 0.563231 0.960784 0.056895 0.070453 0.027663 0.068739 0.102840 0.031883 0.059574 0.001344 0.75
19 0.207171 0.563231 1.000000 0.038176 0.046089 0.019255 0.045058 0.145060 0.032746 0.057800 0.001462 0.75
20 0.207171 0.564293 0.764706 0.019050 0.028399 0.011379 0.027134 0.059165 0.026073 0.074566 0.001426 0.75
21 0.207171 0.564293 0.803922 0.041635 0.056797 0.025954 0.060023 0.083695 0.025747 0.067001 0.001486 0.75
22 0.207171 0.563231 1.000000 0.061905 0.083799 0.028364 0.078441 0.084488 0.030129 0.066344 0.001152 0.75
23 0.207171 0.563231 1.000000 0.042881 0.052142 0.023824 0.053281 0.115909 0.030821 0.058293 0.001555 0.75
24 0.207171 0.563231 1.000000 0.056514 0.067660 0.028112 0.069232 0.144832 0.031362 0.057163 0.001361 0.75
25 0.206175 0.564293 0.784314 0.013556 0.018932 0.008801 0.019405 0.131302 0.025872 0.069149 0.001587 0.75
26 0.206175 0.564293 0.941176 0.028689 0.037709 0.016929 0.039138 0.135157 0.027519 0.064135 0.001487 0.75
27 0.206175 0.564293 1.000000 0.048222 0.065177 0.030802 0.065121 0.090213 0.027893 0.066246 0.001677 0.75
28 0.206175 0.563231 0.960784 0.052902 0.076195 0.031615 0.077619 0.078792 0.025205 0.071447 0.001367 0.75
29 0.206175 0.563231 1.000000 0.018490 0.024674 0.010987 0.025325 0.081902 0.027343 0.065409 0.001494 0.75
... ... ... ... ... ... ... ... ... ... ... ... ...
20610 0.277888 0.697131 0.529412 0.054123 0.074953 0.033409 0.072028 0.059530 0.028397 0.068189 0.001633 0.00
20611 0.278884 0.697131 0.509804 0.045297 0.068281 0.032512 0.067094 0.054192 0.024906 0.075402 0.001731 0.00
20612 0.277888 0.695005 0.490196 0.034971 0.044693 0.021245 0.043743 0.068516 0.030562 0.061964 0.001737 0.00
20613 0.278884 0.696068 0.588235 0.043898 0.056487 0.032624 0.062983 0.068682 0.025902 0.062448 0.001889 0.00
20614 0.279880 0.695005 0.490196 0.057836 0.071229 0.040696 0.077783 0.135833 0.028041 0.059177 0.001913 0.00
20615 0.279880 0.695005 0.431373 0.027316 0.033364 0.020208 0.032232 0.128267 0.032721 0.058687 0.002400 0.00
20616 0.280876 0.695005 0.274510 0.045984 0.068281 0.032344 0.061503 0.106688 0.028218 0.074078 0.001926 0.00
20617 0.280876 0.692880 0.372549 0.014217 0.016760 0.008548 0.018582 0.193253 0.028887 0.056374 0.001617 0.00
20618 0.278884 0.692880 0.470588 0.033827 0.038175 0.020264 0.037000 0.120695 0.035783 0.053196 0.002028 0.00
20619 0.277888 0.687566 0.411765 0.048044 0.052607 0.028588 0.048512 0.153819 0.039290 0.051174 0.002224 0.00
20620 0.285857 0.691817 0.764706 0.004985 0.006207 0.004148 0.007729 0.280175 0.023244 0.060957 0.001974 0.00
20621 0.286853 0.687566 0.705882 0.031589 0.038175 0.013481 0.025654 0.128702 0.050172 0.057902 0.001924 0.00
20622 0.289841 0.686504 0.372549 0.019152 0.022657 0.012725 0.025654 0.132191 0.028092 0.056520 0.001785 0.00
20623 0.296813 0.689692 0.607843 0.029401 0.037709 0.016676 0.037165 0.160246 0.030165 0.062262 0.001563 0.00
20624 0.292829 0.690755 0.294118 0.043135 0.046400 0.020404 0.047690 0.177515 0.035366 0.050054 0.001464 0.00
20625 0.281873 0.699256 0.705882 0.002543 0.002483 0.000729 0.002138 0.250003 0.045650 0.046463 0.001110 0.00
20626 0.290837 0.705632 0.686275 0.028537 0.028399 0.014042 0.027956 0.114950 0.040599 0.045399 0.001815 0.00
20627 0.301793 0.700319 0.078431 0.009054 0.009932 0.004653 0.009538 0.172418 0.037016 0.051807 0.001748 0.00
20628 0.285857 0.697131 0.352941 0.051910 0.065177 0.028448 0.063970 0.144501 0.031137 0.060598 0.001543 0.00
20629 0.294821 0.699256 0.529412 0.255176 0.287865 0.193643 0.298800 0.109957 0.033132 0.053023 0.002502 0.00
20630 0.301793 0.717322 0.196078 0.067094 0.078212 0.035147 0.073014 0.211542 0.036058 0.055295 0.001716 0.00
20631 0.293825 0.721573 0.274510 0.067475 0.076350 0.033549 0.070877 0.208135 0.037570 0.053286 0.001678 0.00
20632 0.288845 0.714134 0.274510 0.058930 0.064401 0.029261 0.063148 0.181039 0.036702 0.051026 0.001631 0.00
20633 0.280876 0.706695 0.509804 0.052851 0.063780 0.030242 0.062654 0.141350 0.032602 0.057731 0.001722 0.00
20634 0.277888 0.715197 0.529412 0.059260 0.061142 0.029093 0.056405 0.221556 0.042059 0.047437 0.001878 0.00
20635 0.324701 0.737513 0.470588 0.042296 0.057883 0.023599 0.054103 0.073130 0.029769 0.067255 0.001503 0.00
20636 0.312749 0.738576 0.333333 0.017676 0.023122 0.009894 0.018582 0.141853 0.037344 0.063876 0.001956 0.00
20637 0.311753 0.732200 0.313725 0.057277 0.075109 0.028140 0.071041 0.082764 0.030904 0.063864 0.001314 0.00
20638 0.301793 0.732200 0.333333 0.047256 0.063315 0.020684 0.057227 0.094295 0.031783 0.065557 0.001152 0.00
20639 0.309761 0.725824 0.294118 0.070782 0.095438 0.038790 0.086992 0.130253 0.031252 0.066021 0.001549 0.00

20640 rows × 12 columns

In [93]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x_housing_scaled,y_housing,test_size=0.2,random_state=42)
In [94]:
x_train.shape,y_train.shape
Out[94]:
((16512, 12), (16512,))
In [95]:
from sklearn.model_selection import GridSearchCV
In [96]:
grid_RandomForest = GridSearchCV(RandomForestRegressor(),param_grid)
In [97]:
grid_RandomForest.fit(x_train,y_train)
C:\Users\Subhasish Das\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
Out[97]:
GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid=[{'max_features': [2, 3, 4],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [98]:
grid_RandomForest.best_params_
Out[98]:
{'max_features': 4, 'n_estimators': 30}
In [99]:
gridtrain_RandomForest_predict = grid_RandomForest.predict(x_train)
In [100]:
gridtest_RandomForest_predict = grid_RandomForest.predict(x_test)
In [101]:
print('R2 square for train',r2_score(y_train,gridtrain_RandomForest_predict))
print('R2 square for test',r2_score(y_test,gridtest_RandomForest_predict))
R2 square for train 0.9721365715033597
R2 square for test 0.8094654175988255
In [102]:
plt.scatter(y_test,gridtest_RandomForest_predict)
plt.xlabel('Actual Sales ')
plt.ylabel('Predicted sales ')
Out[102]:
Text(0, 0.5, 'Predicted sales ')
In [103]:
parameters_lr = {'fit_intercept':[True,False], 'normalize':[True,False], 'copy_X':[True, False]}
In [104]:
gridlr = GridSearchCV(LinearRegression(),parameters_lr, cv=5)
In [105]:
gridlr.fit(x_train,y_train)
Out[105]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LinearRegression(copy_X=True, fit_intercept=True,
                                        n_jobs=None, normalize=False),
             iid='warn', n_jobs=None,
             param_grid={'copy_X': [True, False],
                         'fit_intercept': [True, False],
                         'normalize': [True, False]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [106]:
gridlr.best_score_
Out[106]:
0.6500265109973369
In [107]:
gridlr_tr = gridlr.predict(x_train)
gridlr_test = gridlr.predict(x_test)
In [108]:
print('R2 square for train',r2_score(y_train,gridlr_tr))
print('R2 square for test',r2_score(y_test,gridlr_test))
R2 square for train 0.6526944611603457
R2 square for test 0.5860794349444838
In [ ]: